Wine Quality Prediction

Introduction

Abstract

Wine making is an intricate process, with many chemical and physical variables influencing the quality and taste of the final product. According to the Grand View Research the United States wine market size was estimated at $63.69 billion in 2021. As wine consumers ourselves, our team wanted to better understand the key drivers of wine quality using data-driven techniques.

Prior studies have examined the chemical properties of wines, but few have conducted extensive statistical analysis to relate these properties to overall quality ratings from wine experts. Our goal was to bridge this gap using exploratory data analysis and statistical modeling.

The analysis will shed light on the factors that contribute to the quality of both red and white wines. It will examine the various chemical components that make up each type of wine. These ingredients each have a distinct relationship with the resulting character of the red or white wine. Understanding these relationships is valuable for wine drinkers who have preferences for certain styles of wine.

Why we chose this topic?

We believe that our efforts to decipher the complex relationship between wine quality and its chemical makeup will be extremely beneficial to wine producers and distributors. This research provides winemakers with crucial assistance to develop wines that have the required characteristics. With a better knowledge of how elements like density, chlorides, and other elements affect wine quality, they may improve their manufacturing methods so that they constantly meet and surpass consumer expectations.

The consumer can also benefit from this analysis. They can understand which ingredients make the quality better and can ask for those wines. They will also understand if the price they’re paying is commensurate with the quality.


Data Description

The dataset includes two tables, one for the red wine and one for the white wine. All of the variables in both datasets are synonymous, thus only one data understanding table was constructed. The exact number of columns and rows for each table (type of wine) is included below:

Wine quality-red 12 columns x 1,599 rows

Wine quality-white 12 columns x 4,898 rows

Head
type fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol quality
white 7.0 0.27 0.36 20.7 0.045 45 170 1.001 3.00 0.45 8.8 6
white 6.3 0.30 0.34 1.6 0.049 14 132 0.994 3.30 0.49 9.5 6
white 8.1 0.28 0.40 6.9 0.050 30 97 0.995 3.26 0.44 10.1 6
white 7.2 0.23 0.32 8.5 0.058 47 186 0.996 3.19 0.40 9.9 6
white 7.2 0.23 0.32 8.5 0.058 47 186 0.996 3.19 0.40 9.9 6

The different columns present are:

Categorical Variable

  1. Type - Colour of wine [ Red or White]

Numerical Variables

  1. Fixed Acidity - Concentration of non-volatile tartaric acid in the wine

  2. Volatile Acidity - Concentration of volatile acetic acid in the wine.

  3. Citric Acid - Concentration of citric acid in the wine.

  4. Residual Sugar - Concentration of sugar remaining after the fermentation in the wine.

  5. Chlorides - Concentration of sodium chloride in the wine.

  6. Free Sulfur Dioxide - Concentration of free, gaseous sulfur dioxide in the wine.

  7. Total Sulfur Dioxide - Total concentration of sulfur dioxide in the wine.

  8. Density - Density of the wine.

  9. pH - Acidity of the wine.

  10. Sulphates - Concentration of potassium sulfate in the wine.

  11. Alcohol - Alcohol content of the wine.

Target Variable

  1. Quality - Wine quality score as assessed by experts.

We use quality as our base factor and have constructed our analysis around. The Quality ranges from 3 - 9, where 3 - Bad Qualtiy and 9 - Good Quality and all that is in between is said to be Average Quality wine!


Research Questions - SMART QUESTIONS!

  1. Which Regression model will predict the quality of wine with the high accuracy?
  2. What is the optimal set of features needed for predicting wine quality?
  3. Does our model support our initial analysis?

The target variable for our research project is : QUALITY


Evolution of Questions

Before we began with our analysis as a data scientists, we questioned ourselves based on our previous experiences of having tasted wine that in general what would effect the overall quality. We summarized that maybe the amount of alcohol present would effect it or the sweetness of the wine, or maybe even the acidity could be an influential factor. Keeping these thoughts in mind we wanted to see if our initial assumptions were true or were we going to find something more interesting through our analysis and also determine to what extent these factors effect the quality!


Data Cleaning

The data which we have right now is in unclean. Meaning that there are a lot of NA values and duplicates. The first step of any data science project is to clean the data.

Right now, we have 6497 rows and 13 variables.

Steps to clean the data:

  1. Remove NA values

  2. Remove duplicate values.

The clean data has 5295 rows and 13 variables. It is free from NA values and duplicates and now can be used for further exploratory data analysis.


Summary Statistics

Now, that we have a completely clean data lets look at the summary statistics of the dataset.

Table: Statistics summary.
type fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol quality
Min Length:5295 Min. : 3.80 Min. :0.080 Min. :0.000 Min. : 0.6 Min. :0.009 Min. : 1 Min. : 6 Min. :0.987 Min. :2.72 Min. :0.220 Min. : 8.0 Min. :3.0
Q1 Class :character 1st Qu.: 6.40 1st Qu.:0.230 1st Qu.:0.240 1st Qu.: 1.8 1st Qu.:0.038 1st Qu.: 16 1st Qu.: 74 1st Qu.:0.992 1st Qu.:3.11 1st Qu.:0.430 1st Qu.: 9.5 1st Qu.:5.0
Median Mode :character Median : 7.00 Median :0.300 Median :0.310 Median : 2.7 Median :0.047 Median : 28 Median :116 Median :0.995 Median :3.21 Median :0.510 Median :10.4 Median :6.0
Mean NA Mean : 7.22 Mean :0.344 Mean :0.319 Mean : 5.1 Mean :0.057 Mean : 30 Mean :114 Mean :0.995 Mean :3.22 Mean :0.533 Mean :10.6 Mean :5.8
Q3 NA 3rd Qu.: 7.70 3rd Qu.:0.410 3rd Qu.:0.400 3rd Qu.: 7.5 3rd Qu.:0.066 3rd Qu.: 41 3rd Qu.:154 3rd Qu.:0.997 3rd Qu.:3.33 3rd Qu.:0.600 3rd Qu.:11.4 3rd Qu.:6.0
Max NA Max. :15.90 Max. :1.580 Max. :1.660 Max. :65.8 Max. :0.611 Max. :289 Max. :440 Max. :1.039 Max. :4.01 Max. :2.000 Max. :14.9 Max. :9.0

A quick look at the summary table tells us that:

  1. Two columns are characters- Type and Quality.

  2. The other nine columns are numeric in nature.

  3. We can see that for most of our columns, the Mean > Median. This means our data has a positive skew.

  4. One can see that most characteristics have some significant outliers as the maximum value is much bigger than their third quantile.

Our next step is to visualize these using plots.


Exploratory Data Analysis (EDA)

Univariate Analysis

1. Quality

Observations

  1. Wine quality shows a rather symmetrical distribution.

  2. Most wines have a quality score of 6.

  3. No wine achieved the highest score of 10 and the worst wines got a rating of 3.

2. Acidity

Observations

  1. Looking at the acidity parameters in boxplots gives a similar picture. One can see the long positive tails of fixed and volatile acide concentrations.

  2. Citric Acid has a bimodal distribution.

  3. The pH level of wines has a normal distribution with median of 3.2

3. Sulphates

Observations

  1. Free sulfur dioxide concentration is narrowly centered around 30 mg/L.

  2. Total sulfur dioxide concentration shows signs of bimodality with peaks around 20 and 120 mg/L.

  3. Most wines have a sulphate concentration around 0.5 g/L. Two small outlier groups around 1.6 and 1.9 g/L can be seen in the boxplot.

4. Sugar, Alcohol, Density, Chlorides Plots

Observations

  1. Generally, the wines in the data set appear to have low residual sugar concentrations. The positive skewing moves the mean value (5.4) above the median (3.0). An extreme outlier can be found around 65 g/L residual sugar.

  2. The density parameter shows a very narrow distribution with low variation. One can see a few outliers around 1.01 and 1.04 g/cm3 but most wines have a density between 0.99 and 1.00 g/cm3.

  3. The histogram showing the chlorine concentration in the data set has two distinct main peaks. The most frequent chlorine concentrations can be found around 0.04 g/L. The second peak appears at about 0.08 g/L. The distribution has a very long tail in the positive direction with outliers up to 0.6 g/L.

  4. The alcohol content of the wines in the data set ranges between 8 and 15 vol%. The median lies around 10 vol%. The distribution is rather wide and shows positive skewing.


Correlation Matrix

Before, we get into further detailed analysis we want to see which are the the variables which significantly influence the quality of the wines. For this we have done a correlation matrix.

Our target variable is Quality and hence, we will explore the variables that are most correlated with Quality.

From the correlation matrix we can see that the 4 most correlated variables are:

  1. Alcohol

  2. Density

  3. Chlorides

  4. Citric Acid


Bivariate Analysis

Through our correlation plot we have understood that Alcohol content, density, citric acid and chloride are the ones that are affecting the quality the most, let us see how and also make a comparative analysis between red and white wine individually!

1. Alcohol vs Quality

Observations

  1. White Wines have higher alcohol content.

  2. Alcohol has a strong positive correlation with quality.

  3. The boxplot shows that wines with higher quality seem to have higher alcohol content.

2. Density vs Quality

Observations:

  1. Red Wines are more dense than white wines.

  2. Density has a negative correlation with quality.

  3. The boxplot shows that wines with higher quality seem to be less dense.

3. Chlorides vs Quality

Observations:

  1. Red Wines have more chloride concentration than white wines.

  2. Chloride Concentration has a slight negative correlation with quality

  3. The boxplot shows that wines with higher quality seem to have less chlorides.

4. Citric Acid vs Quality

Observations:

  1. White Wines have more citric acid concentration than red wines.

  2. Citric Acid Concentration has a slight positive correlation with quality.

  3. There isn’t much difference in citric acid concentration in white wines across the quality ratings.

  4. The boxplot shows that wines with higher quality seem to have a high citric acid.


Multivariate Analysis

For the last part of our EDA, we will perform some multivariate plots to see some how the other non-important features in wine are distributed in red and white wine.

Observations:

  1. White Wines have more sugar concentration than red wines. This might explain why white wines are usually sweeter.

  2. Red wines contain high chloride and sulphate concentrations.


Statistical Tests

We have performed a few statistical tests to support our analysis.

1. Alcohol vs Quality

WELCH TWO SAMPLE T-TEST:

NULL HYPOTHESIS (H0): There is no significant mean difference between red and white wine in alcohol content.

ALTERNATE HYPOTHESIS (H1): There is significant mean difference between the two wines.

We found that the p-value : 3.606^{-6} is less than 0.05, we reject the null hypothesis and conclude that there is a significant difference in mean alcohol concentration between red and white wine.

ANOVA TEST FOR RED AND WHITE WINE:

NULL HYPOTHESIS (H0): There is no significant difference in mean alcohol content across quality categories in red wine/white wine.

ALTERNATE HYPOTHESIS (H1): There is significant difference in mean alcohol content across quality categories in red wine/white wine

We carried out separate tests for each wine and we found that the p-value for both red and white wine are less that 0.05, we reject the null hypothesis and conclude that there is a significant difference in mean alcohol content across all categories of both the wine.

2. Density vs Quality

WELCH TWO SAMPLE T-TEST:

NULL HYPOTHESIS (H0): There is no substantial mean density difference between red and white wine.

ALTERNATE HYPOTHESIS (H1): The two wines have a considerable mean difference.

We found that the p-value : 7^{-322} is less than 0.05, we reject the null hypothesis and conclude that there is a significant difference in mean density concentration between red and white wine.

RED AND WHITE WINE ANOVA TEST:

NULL HYPOTHESIS (H0): There is no significant variation in mean density between red wine and white wine quality groups.

ALTERNATE HYPOTHESIS (H1): There is a considerable difference in mean density between red wine and white wine quality groups.

We carried out separate ANOVA tests for each wine and we found that the p-value for both red and white wine are less that 0.05, we reject the null hypothesis and conclude that there is a significant difference in mean density content across all categories of both the wine.

3. Chloride vs Quality

WELCH TWO SAMPLE T-TEST:

NULL HYPOTHESIS (H0): There is no significant mean difference in chloride concentration between red and white wine.

ALTERNATE HYPOTHESIS (H1): There is a significant mean difference between the two wines.

We found that the p-value : 3.706^{-159} is less than 0.05, we reject the null hypothesis and conclude that there is a significant difference in mean chloride concentration between red and white wine.

ANOVA TEST FOR RED AND WHITE WINE:

NULL HYPOTHESIS (H0): There is no significant difference in mean chloride concentration across quality categories in red wine/white wine.

ALTERNATE HYPOTHESIS (H1): There is a significant difference in mean chloride concentration across quality categories in red wine/white wine.

We carried out separate ANOVA tests for each wine and we found that the p-value for both red and white wine are less that 0.05, we reject the null hypothesis and conclude that there is a significant difference in mean chloride content across all categories of both the wine.

4. Citric acid vs Quality

WELCH TWO SAMPLE T-TEST:

NULL HYPOTHESIS (H0): There is no significant mean difference in citric acid content between red and white wine.

ALTERNATE HYPOTHESIS (H1): There is a significant mean difference between the two wines.

We found that the p-value: 1.335^{-26} is less than 0.05, we reject the null hypothesis and conclude that there is a significant difference in mean citric acid concentration between red and white wine.

ANOVA TEST FOR RED AND WHITE WINE:

NULL HYPOTHESIS (H0): There is no significant difference in mean citric acid content across quality categories in red wine/white wine.

ALTERNATE HYPOTHESIS (H1): There is a significant difference in mean citric acid content across quality categories in red wine/white wine.

We carried out separate tests for each wine and we found that the p-value for red is less than 0.05, where as for the white wine the p value is greater than 0.05, Hence we reject the null hypothesis for red wine only and concluded that there is a significant difference in mean citric content in red wine.

##MODELING
## 1.Linear Regression
Wine quality is consistently predicted by alcohol in all plots; models containing this variable have higher adjusted R², lower BIC, higher R², and lower Cp.
## ## Call: ## lm(formula = quality ~ volatile.acidity + residual.sugar + free.sulfur.dioxide + ## total.sulfur.dioxide + pH + sulphates + alcohol, data = numeric_data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.992 -0.450 -0.024 0.462 3.144 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.391874 0.240127 5.80 7.2e-09 *** ## volatile.acidity -1.390559 0.068586 -20.27 < 2e-16 *** ## residual.sugar 0.017460 0.002675 6.53 7.3e-11 *** ## free.sulfur.dioxide 0.006692 0.000825 8.11 6.4e-16 *** ## total.sulfur.dioxide -0.002224 0.000285 -7.81 6.7e-15 *** ## pH 0.273930 0.066893 4.10 4.3e-05 *** ## sulphates 0.620295 0.071304 8.70 < 2e-16 *** ## alcohol 0.344362 0.009197 37.44 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.734 on 5287 degrees of freedom ## Multiple R-squared: 0.305, Adjusted R-squared: 0.304 ## F-statistic: 331 on 7 and 5287 DF, p-value: <2e-16
## [1] 0.728
## MSE 0.538
Observations
1. The model seems to fit the data fairly well, with some predictors showing strong correlations with wine quality.
2. The model explains 30% of variance

2.Logistic Regression

We need a categorical outcome variable for logistic regression. As a result, we are designating as “Bad Wines” all wines from 0 to 6 and 6-10 as “Good Wines.”

type fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol quality quality_category
1 white 7.0 0.27 0.36 20.7 0.045 45 170 1.001 3.00 0.45 8.8 6 1
2 white 6.3 0.30 0.34 1.6 0.049 14 132 0.994 3.30 0.49 9.5 6 1
3 white 8.1 0.28 0.40 6.9 0.050 30 97 0.995 3.26 0.44 10.1 6 1
4 white 7.2 0.23 0.32 8.5 0.058 47 186 0.996 3.19 0.40 9.9 6 1
7 white 6.2 0.32 0.16 7.0 0.045 30 136 0.995 3.18 0.47 9.6 6 1
10 white 8.1 0.22 0.43 1.5 0.044 28 129 0.994 3.22 0.45 11.0 6 1
## 
## Call:
## glm(formula = quality_category ~ volatile.acidity + residual.sugar + 
##     pH + sulphates + alcohol + free.sulfur.dioxide + total.sulfur.dioxide, 
##     family = binomial(link = "logit"), data = wine)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -1.17e+01   8.29e-01  -14.07  < 2e-16 ***
## volatile.acidity     -4.28e+00   2.45e-01  -17.45  < 2e-16 ***
## residual.sugar        5.22e-02   8.64e-03    6.04  1.5e-09 ***
## pH                    8.50e-01   2.24e-01    3.80  0.00015 ***
## sulphates             1.89e+00   2.48e-01    7.64  2.1e-14 ***
## alcohol               9.57e-01   3.71e-02   25.78  < 2e-16 ***
## free.sulfur.dioxide   1.84e-02   2.80e-03    6.59  4.5e-11 ***
## total.sulfur.dioxide -7.04e-03   9.27e-04   -7.59  3.1e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6999.2  on 5294  degrees of freedom
## Residual deviance: 5444.6  on 5287  degrees of freedom
## AIC: 5461
## 
## Number of Fisher Scoring iterations: 4
##          (Intercept)     volatile.acidity       residual.sugar 
##             8.53e-06             1.39e-02             1.05e+00 
##                   pH            sulphates              alcohol 
##             2.34e+00             6.65e+00             2.60e+00 
##  free.sulfur.dioxide total.sulfur.dioxide 
##             1.02e+00             9.93e-01

Observations

Using chemical predictors, a logistic regression model was employed to forecast a binary quality category for wine. All of the predictors have p-values less than 0.05, indicating statistical significance. The likelihood of a higher quality category is positively impacted by certain predictors but negatively impacted by volatile acidity and total sulfur dioxide. The substantial difference between the null and residual deviances indicates that the model fits considerably better than an intercept-only model.

Confusion Matrix: Logit model, cutoff = 0.6
Predicted 0 Predicted 1 Total
Actual 0 1185 794 1979
Actual 1 530 2786 3316
Total 1715 3580 5295

Observations 1. The model shows a respectable degree of predictive ability by predicting more true positives and true negatives than false positives and negatives. 2. However, there are comparatively many false positives, which could be problematic given the particular expenses or hazards connected to a false positive result in the particular situation.

## Area under the curve: 0.808

Observations Plotting sensitivity (true positive rate) against 1-specificity (false positive rate) at different threshold values is what the curve depicts. The test is more accurate if the curve closely follows the top and left borders of the ROC space. The prediction model’s ability to discriminate between the two classes can be evaluated using the area under the curve (AUC). The image’s curve seems to be near the upper left corner, indicating that the model’s discriminative power is strong.

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  wine$quality_category, fitted(winelogit1)
## X-squared = 11, df = 8, p-value = 0.2

Observation The logistic regression model’s Hosmer-Lemeshow goodness of fit test yields a chi-squared statistic of 11 with 8 degrees of freedom and a p-value of 0.2. This shows that the model appropriately fits the data and there is no evidence of poor fit.

## 'log Lik.' 0.222 (df=8)
## fitting null model for pseudo-r2
##       llh   llhNull        G2  McFadden      r2ML      r2CU 
## -2722.313 -3499.575  1554.526     0.222     0.254     0.347
## Accuracy of logistic regression model is 0.75
##  Precision of logistic regression model is 0.778

Observations 1. With a 75% accuracy rate, the logistic regression model predicts the result 75% of the time. 2. With a precision of 77.8%, the model is roughly right 77.8% of the time when it predicts a positive result

##KNN Model
##Decision Tree
## Call: ## rpart(formula = quality ~ fixed.acidity + volatile.acidity + ## residual.sugar + free.sulfur.dioxide + density + pH + sulphates + ## alcohol, data = trainData) ## n= 4765 ## ## CP nsplit rel error xerror xstd ## 1 0.1665 0 1.000 1.000 0.0220 ## 2 0.0338 1 0.834 0.844 0.0207 ## 3 0.0321 2 0.800 0.811 0.0206 ## 4 0.0136 3 0.768 0.789 0.0199 ## 5 0.0100 4 0.754 0.783 0.0197 ## ## Variable importance ## alcohol density volatile.acidity sulphates ## 49 25 12 5 ## fixed.acidity residual.sugar free.sulfur.dioxide pH ## 4 2 2 1 ## ## Node number 1: 4765 observations, complexity param=0.166 ## mean=5.8, MSE=0.781 ## left son=2 (2767 obs) right son=3 (1998 obs) ## Primary splits: ## alcohol < 10.6 to the left, improve=0.1660, (0 missing) ## density < 0.992 to the right, improve=0.1030, (0 missing) ## volatile.acidity < 0.458 to the right, improve=0.0519, (0 missing) ## free.sulfur.dioxide < 18.5 to the left, improve=0.0223, (0 missing) ## sulphates < 0.685 to the left, improve=0.0132, (0 missing) ## Surrogate splits: ## density < 0.993 to the right, agree=0.790, adj=0.500, (0 split) ## sulphates < 0.405 to the right, agree=0.607, adj=0.062, (0 split) ## fixed.acidity < 6.05 to the right, agree=0.603, adj=0.054, (0 split) ## residual.sugar < 5.28 to the right, agree=0.586, adj=0.014, (0 split) ## pH < 3.54 to the left, agree=0.583, adj=0.007, (0 split) ## ## Node number 2: 2767 observations, complexity param=0.0338 ## mean=5.49, MSE=0.553 ## left son=4 (1498 obs) right son=5 (1269 obs) ## Primary splits: ## volatile.acidity < 0.282 to the right, improve=0.08210, (0 missing) ## alcohol < 10.1 to the left, improve=0.03980, (0 missing) ## free.sulfur.dioxide < 24.5 to the left, improve=0.02160, (0 missing) ## density < 0.995 to the right, improve=0.00820, (0 missing) ## sulphates < 0.675 to the left, improve=0.00746, (0 missing) ## Surrogate splits: ## free.sulfur.dioxide < 29.5 to the left, agree=0.654, adj=0.246, (0 split) ## density < 0.995 to the right, agree=0.634, adj=0.203, (0 split) ## sulphates < 0.505 to the right, agree=0.628, adj=0.190, (0 split) ## residual.sugar < 4.15 to the left, agree=0.625, adj=0.182, (0 split) ## fixed.acidity < 6.85 to the right, agree=0.598, adj=0.124, (0 split) ## ## Node number 3: 1998 observations, complexity param=0.0321 ## mean=6.22, MSE=0.786 ## left son=6 (1100 obs) right son=7 (898 obs) ## Primary splits: ## alcohol < 11.7 to the left, improve=0.0760, (0 missing) ## free.sulfur.dioxide < 11.5 to the left, improve=0.0442, (0 missing) ## volatile.acidity < 0.587 to the right, improve=0.0405, (0 missing) ## density < 0.991 to the right, improve=0.0311, (0 missing) ## fixed.acidity < 7.05 to the right, improve=0.0117, (0 missing) ## Surrogate splits: ## density < 0.991 to the right, agree=0.712, adj=0.359, (0 split) ## volatile.acidity < 0.282 to the left, agree=0.582, adj=0.069, (0 split) ## sulphates < 0.365 to the right, agree=0.573, adj=0.050, (0 split) ## fixed.acidity < 5.95 to the right, agree=0.571, adj=0.046, (0 split) ## residual.sugar < 13.6 to the left, agree=0.553, adj=0.006, (0 split) ## ## Node number 4: 1498 observations ## mean=5.3, MSE=0.443 ## ## Node number 5: 1269 observations ## mean=5.72, MSE=0.583 ## ## Node number 6: 1100 observations, complexity param=0.0136 ## mean=6, MSE=0.768 ## left son=12 (174 obs) right son=13 (926 obs) ## Primary splits: ## volatile.acidity < 0.455 to the right, improve=0.0600, (0 missing) ## free.sulfur.dioxide < 11.5 to the left, improve=0.0533, (0 missing) ## pH < 3.5 to the right, improve=0.0157, (0 missing) ## sulphates < 0.675 to the left, improve=0.0154, (0 missing) ## fixed.acidity < 7.05 to the right, improve=0.0129, (0 missing) ## Surrogate splits: ## pH < 3.5 to the right, agree=0.853, adj=0.069, (0 split) ## density < 1 to the right, agree=0.847, adj=0.034, (0 split) ## fixed.acidity < 14.9 to the right, agree=0.844, adj=0.011, (0 split) ## ## Node number 7: 898 observations ## mean=6.49, MSE=0.675 ## ## Node number 12: 174 observations ## mean=5.51, MSE=0.79 ## ## Node number 13: 926 observations ## mean=6.09, MSE=0.709 ## ## n= 4765 ## ## node), split, n, deviance, yval ## * denotes terminal node ## ## 1) root 4765 3720 5.80 ## 2) alcohol< 10.6 2767 1530 5.49 ## 4) volatile.acidity>=0.282 1498 664 5.30 * ## 5) volatile.acidity< 0.282 1269 740 5.72 * ## 3) alcohol>=10.6 1998 1570 6.22 ## 6) alcohol< 11.7 1100 845 6.00 ## 12) volatile.acidity>=0.455 174 137 5.51 * ## 13) volatile.acidity< 0.455 926 657 6.09 * ## 7) alcohol>=11.7 898 606 6.49 *
Observations
The overall sample size is 4765 observations. The tree structure shows the decision paths and outcomes, with terminal nodes marked by an asterisk. Higher alcohol levels seem to lead to higher predicted quality scores

Results

  1. Wines with higher alcohol content, increased citric acid levels, lower density, and less chlorides tend to exhibit higher quality.

  2. It appears that white wines, in general, tend to be sweeter and have higher alcohol content when compared to their red counterparts.

  3. Red Wines have more concentration of sulphates and chlorides.

  4. Logistic Regression was the best model to predict wine quality.

  5. From feature selection we found out that Volatile Acidity, Residual Sugar, Sulphur Concentration, Alcohol, and pH are the most important attributes while predicting wine quality.

  6. Higher Alcohol content, higher citric acid, less dense, and less chlorides make for better wine.

  7. White wines are sweeter than red wines because of higher sugar content.

  8. Initially we found out that citric acid and chlorides influence the quality but, after modelling, we found out that volatile acidity and sulphur concentration influence it more.


Limitations and Future scope

Although we thoroughly believe in our analysis, we have to mention a few anomalies that are present that may or may not have influenced the results.

  1. The number of data that we have on white wine is comparatively more than that of red wine.

  2. We have also observed that most of the data present are in the average quality range i.e, from 4-8.

  3. Classification algorithms are also available to group similar quality wines.

A MORE BALANCED SET OF DATA WILL IMPROVE OUR ANALYSIS.


Conclusions

Through our exploration of wine quality we have come across some discoveries that challenge wine making beliefs. While alcohol content, sugar levels and acidity are known factors we have found that attributes, like density and chlorides also play roles in shaping the quality of wine. These elements contribute in ways that captivate our senses and define what truly makes a bottle of wine.

Moreover we have delved into the contrast between white wines going beyond mere color differences to uncover deeper variations in chemical composition and sensory profiles. By understanding these distinctions we empower wine enthusiasts and consumers to make choices based on their personal preferences and specific occasions. With the support of tests such as the T Test to check if there is a significant difference between the factors for red nd white wine and ANOVA Tests to find how these factors vary across different quality ratings in red and white wine our findings are substantiated by evidence establishing a strong framework for comprehending the intricate interplay of variables that influence wine quality.

Equipped with this knowledge let us raise our glasses to the fusion of data science and wine making artistry. We also hope that one uses this knowledge to buy great wines.


Additional Insights

Even if our present investigation has been fascinating, we understand that the quest to predict wine quality is a dynamic endeavour. Our commitment to gaining deeper insights is demonstrated by our efforts to include a wider range of attributes that affect wine quality to our dataset. This growth is a calculated step towards a more comprehensive knowledge of the complex process of wine making, not just an increase in data. We want to build a predictive model that goes into the nuances of winemaking and captures every detail that contributes to wine greatness by incorporating a wider variety of variables.

We hope to improve our predictive model in this ongoing investigation to deliver even more thorough and accurate results.


References

We have included necessary citation for the data set from:

  1. Cortez, P., Teixeira, J., Cerdeira, A., Almeida, F., Matos, T., Reis, J. (2009). Using Data Mining for Wine Quality Assessment. In: Gama, J., Costa, V.S., Jorge, A.M., Brazdil, P.B. (eds) Discovery Science. DS 2009. Lecture Notes in Computer Science(), vol 5808. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04747-3_8

  2. https://www.kaggle.com/datasets/yasserh/wine-quality-dataset/data

  3. https://link.springer.com/chapter/10.1007/978-3-642-04747-3_8